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Abstract 

We introduce a novel parser based on a probabilistic version of a left-corner parser. The left-corner strategy is 
attractive because rule probabilities can be conditioned on both top-down goals and bottom-up derivations. We 
develop the underlying theory and explain how a grammar can be induced from analyzed data. We show that the 
left-corner approach provides an advantage over simple top-down probabilistic context-free grammars in parsing 
the Wall Street Journal using a grammar induced from the Penn Treebank. We also conclude that the Penn 
Treebank provides a fairly weak testbed due to the flatness of its bracketings and to the obvious overgeneration and 
undergeneration of its induced grammar. 



1 Introduction 



For context-free grammars (CFGs), there is a well-known standard p robabilistic ve r sion, Probabil i stic Context- 
Free Grammars (PCFGs), which have been thoroughly investigated [Buppes, 1970 , [gankoff, 1971 , Baker, 1979 . 
Lari and Young, 1990] , [Kupiec, 199l] [Jelinek et al., 1992[ pharniak, 1993[ . 

Under this model, one assigns probabilities for different rewrites of a non-terminal. Or in other words, one 
is giving the probability of a local subtree given the mother node. So, for example, we might have: 



P(NP Pron\NP mother) = 0.1 



P(NP Det N\NP mother) = 0.2 

where in general, V nonterminals A, P(A — > 7) = 1. 

But standard PCFGs are only one way to make a probabilistic version of CFGs. If we think in parsing 
terms, a PCFG corresponds to a probabilistic version of top down parsing, since at each stage we are trying 
to predict the child nodes given knowledge only of the parent node. Other parsing methods lend themselves to 
different models of probabilistic conditioning. Usually, such conditioning is a mixture of top-down and bottom- 
up information. This paper discusses some initial results from another point in this parameter space where the 
conditioning reflects a left-corner parsing strategy, yielding what we will call probabilistic left-corner grammars 
(PLCGs).[] Left -corner parsers simultaneously work top-down from a goal category and bottom-up from the 

"We thank Edward Stabler and Mark Johnson for getting us interested in left corner parsing, and Mark more particularly for 
valuable discussion of some of the work in this paper. 

1 This name may appear strange since the symbolic part of the grammar is unchanged and still context-free. But if we regard 
the probabilistic conditioning as part of the grammar, then we do have a different kind of grammar. We should then perhaps call 
the result a LCPG, but we place the P in initial position for reasons of tradition. 



Expansion % as Subj % as Obj 



NP 


-> PRP 


13.7% 


2.1% 


NP 


-> NNP 


3.5% 


0.9% 


NP 


-> DT NN 


5.6% 


4.6% 


NP 


-> NN 


1.4% 


2.8% 


NP 


-> NP SBAR 


0.5% 


2.6% 


NP 


-> NP PP 


5.6% 


14.1% 



Table 1: Selected common expansions of NP as Subject vs. Object 



left corner of a particular rule. For instance, a rule such as S — > NP VP has the left-corner NP and will be 
fired whenever an NP has been derived and an S would help toward the eventual goal category. In this paper, 
we present algorithms for PLCG parsing, present some results comparing PLCG parsing with PCFG parsing, 
and discuss some mechanisms for improving results. 

Why might one want to employ PLCGs? While the main perc eived weakness of PCFGs is their lack 
of lexicalization, they are also deficient on purely structural grounds [ Briscoe and Carroll, 1993 1. Inherent to 
the idea of a PCFG is that probabilities are context-free: for instance, that the probability of a noun phrase 
expanding in a certain way is independent of where the NP is in the tree. Even if we in some way lexicalize 
PCFGs to remove the other deficiency, this assumption of structural context-freeness remains. But this context- 
free assumption is actually quite wrong. For example, Table [j] shows how the probabilities of expanding an NP 
node (in the Penn Treebank) differ wildly between subject position and object position. Pronouns, proper names 
and definite NPs appear more commonly in subject position while NPs containing post-head modifiers and bare 
nouns occur more commonly in object position (this reflects the fact that the subject normally expresses the 
sentence-internal topic [Manning, 1996 ). 

Another advantage of PLCGs is that parse probabilities are straightforwardly calculated from left to right, 
which is convenient for online processing and integration with other linear probabilistic models.^] 



2 Probabilistic Left Corner Grammars 



Left corner parsers | Roscnkrantz and Lewis II, 197(j| , Demers, 1977 work by a combination of bottom-up and 



top-down processing. One begins with a goal category (the root of what is currently being constructed), and 
then looks at the left corner of the string (i.e., one shifts the next terminal). If the left corner is the same 
category as the goal category, then one can stop. Otherwise, one projects a possible local tree from the left 
corner. The remaining children of this projected local tree then become goal categories and one recursively does 
left corner parsing of each. When this local tree is finished, one again recursively does left-corner parsing with 
this subtree as the left corner, and the same goal category. To make this description more precise, a Prolog 
version of a simple left corner recognizer is shown in Figure [l]. This particular parser assumes a rule format for 
rule/2 that allows lexical material to appear on the right-hand side of a rule.^] 

A common formulation of left corner parsers is in terms of a stack of found and sought constituents, the 
latter being represented as minus categories on the stack (and represented as m(Cat) in the Prolog code). A left 
corner parser that uses a stack is shown in Figure H Shifting is now an explicit option on a par with projecting 
and attaching, but note that when to shift remains deterministic. If the thing on top of the stack is a predicted 
m(Cat), then one must shift, and one can never successfully shift at other times. This second version of the 
parser more transparently corresponds to the probabilistic language model we employ. 

To produce a language model that reflects the operation of a left corner parser, we have to provide prob- 
abilities for the different operations (the clauses of process in Figure ||). For each step, we need to decide the 
probabilities of deciding to shift, attach, or project. The only interesting choice here is deciding whether to 
attach in cases where the left corner category and the goal category are the same. For the other two operations 
of the parser, we need to model the probability of shifting different terminals, and the probability of building a 



2 



Note however, that while the obvio us way of calculating PCFG probabilities does not allow incremental processing, incremental 



calculation is possible, as discussed by [Jelinek et al., 1992 1. 

3 In general, empty categories can be accommodated by allowing a category to be introduced for completion without popping a 
word off the input stack. 



"/, lc(List_of _words_to_parse) 
lc(Ws) :- start (C) , 

complete_list( [C] , Ws , [] ) . 

complete(C, C, Ws , Ws) . % attach 

complete(W, C, Ws, NewWs) :- 7, project lc 

rule(LHS, [W I Rest]), */. lex / phrase 

complete_list (Rest , Ws, Ws2) , 

complete (LHS, C, Ws2, NewWs) . 

complete_list ( [] , Ws , Ws) . 
complete_list([ClCs] , [WlWs] , NewWs) :- 

complete(W, C, Ws, Ws2) , % shift 

complete_list(Cs, Ws2, NewWs). 

Figure 1: A Prolog LC parser 

slc(Ws) :- start (C) , 
slc(Ws, [m(C)]). 

slc([], []). 
slc(L0, StackO) :- 

process (StackO, Stack, LO, L), 

slc(L, Stack) . 

process([A, m(A)|Stack], Stack, L, L) 
process ( [Item I Items] , Stack, L, L) :- 
rule (LHS, [Item I Rest] ) , 
predict (Rest, [LHS I Items], Stack) 
process (Stack, [LlStack], [L|Ls], Ls) 

predict ( [] , L, L) . 

predict ( [L I Ls] , L2, [m(L) I NewLs] ) :- 
predict (Ls, L2, NewLs) . 

Figure 2: A Prolog LC stack parser 

certain local tree given the left corner (7c) and the goal category (gc). Under this model, we have probabilities 
for this last operation like this: 

P(SBar -► P S\lc = P, gc = S) = 0.25 
P(PP -> P NP\lc = P, gc = S) = 0.55 

How to make probabilities out of the above choices is made precise in the next section. 
2.1 The LC probability of a parse 

In this section, we provide probabilities for left-corner derivations. These form the basis for a language model 
that assigns probabilities to sentences. For a sentence s, we have that the probability of a sentence according 
to a grammar G is: 

P(s\G) = P(s, t\G), t a parse tree of s 
t 

{t: yield(t)=s} 

The last line follows since the parse tree determines the terminal yield. It is therefore sufficient to be able to 
calculate the probability of a (parse) tree. Below we suppress the conditioning of the probability according to 



. /, attach 
% project LC 



. ■/„ shift 



the grammar. 

Now following the intuition of our model having been inspired by left corner parsing, we can express the 
probability of a parse tree in terms of the probabilities of left corner derivations of that parse tree: 



P(t) = £ P(d) 

d a LC derivation of t 

But under left corner parsing, each parse tree has a unique derivation and so the summation sign can be dropped 
from this equation. 

Now, without any assumptions, the probability of a derivation can be expressed as a product in terms of the 
probabilities of each of the individual operations in the derivation. Suppose that (Ci, . . . , C' m ) is the sequence 
of operations in the LC parse derivation d of t. Then, by the chain rule, we have: 

p(t)=p(d)= n p(ciici,...,ci_i) 

Ci,...,C m 

In practice, we cannot condition the probability of each parse decision on the entire history. The simplest 
model, which we will explore for the rest of this section, is to assume that the probability of each parse decision 
is largely independent of the parse history, and just depends on the state of the parser. In particular, we 
will assume that it depends simply on the left corner and top goal categories of the parse stack. This drastic 
assumption nevertheless gives us a slightly richer probabilistic model than a PCFG, because elementary left- 
corner parsing actions are conditioned by the goal category, rather than simply being the probability of a local 
tree. For instance, the probability of a certain expansion of NP may be different in subject position and object 
position, because the goal category is different. 

Each elementary operation of a left corner parser is either a shift, an attach or a left corner projection. 
Under the independence assumptions mentioned above, the probability of a shift will simply be the probability 
of a certain left corner daughter (Ic) being shifted given the current goal category (gc), which we will model 
by Pshift- Note that when to shift is deterministic. If a goal (i.e., minus) category is on top of the stack (and 
hence there is no left corner category), then one must shift. Otherwise one cannot. If one is not shifting, one 
must choose to attach or project, which we model by P a tt- Attaching only has a non-zero probability if the 
left corner and the goal category are the same, but we define it for all pairs. If we do not attach, we project a 
constituent based on the left corner with probability Pi c . Thus the probability of each elementary operation Ci 
can be expressed in terms of probability distributions P s hift, Patt, and P; c as follows: 



v>(n wfi- 1 \ - / p shift(lc\gc) if top of the stack is gc 

P(d- shift Ic) - | Q otherwise 

P(C t = attach) = 



Patt(lc, gc) if top of the stack is not gc 
otherwise 



P(d = project A 7) = 
Where these operations obey the following: 



(1 — P a tt(lc, gc))Pi c (A — > j\lc, gc) if top of the stack is not gc 
otherwise 



/;Pshift(lc\gc) = 1 



Ic 



If Ic ^ gc, P a tt(lc, gc) = 
P(A^ 1 \lc,gc) = I 

{A^~f.~/=lc,...} 

From the above we note that the probabilities of the choice of projections sums to one, and hence, since 
other probabilities are complements of each other, the probabilities of the actions available for each elementary 
operation sum to one. There are also no dead ends in a derivation, because unless A is a possible left corner 
constituent of gc, P(A — > j\lc, gc) = 0. Thus we have shown that these probabilities define a language model.0 
That is, J2 S P( S \G) = 1- It is possible to extend the PLCG model in various ways to include more probabilistic 
conditioning, as we discuss briefly later, but our current results reflect this model. 

4 Subject to showing that the probability mass accumulates in finite trees. 
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12900 


0.09 


S -> 
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0.19 


NP 
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12575 


0.04 



Table 2: Highest frequency CFG rules in Penn Treebank 



2-12 word sentences 


this paper 


Charniak 


Grammar (rules) 


14 971 


10 605 


% sent, length < cutoff 


16.6% 




Test set size (sentences) 


401 




Average Length (words) 


8.3 


8.7 


Precision 


89.8% 


88.6 


Recall 


90.7% 


91.7 


Labelled Precision 


83.5% 




Labelled Recall 


82.9% 




Labelled Precision +1 


87.1% 




Labelled Recall +1 


85.2% 




Average CBs 


0.27 




Non-crossing accuracy 


95.8% 


97.9% 


Sentences with CBs 


84.5% 





Table 3: PCFG results 



3 Parsing experiments 



3.1 PCFG Experiment 



Training and testing were done on Release 2 of the Penn Treebank [Marcus et al., 1993 , published in 1995. As 
in other recent work [Magerman, 1995, Collins, 1996 1 , training was done on sections 02-21 of the Wall Street 
Journal portion of the treebank (approximately 40,000 sentences, 780,153 local trees) and final testing was done 
on section 23, which contains 2416 sentences. Counts of how often each local tree occurred in the treebank were 
made, and these were used directly to give probabilities for rewriting each nonterminal. The highest frequency 
rules are given in Table ^|. Local trees were considered down to the level of preterminals (i.e., part of speech 
tags); lexical information was ignored^] Every tree was given a new root symbol 'ROOT', attached by a unary 
branch to the root in the treebank. Empty nodes (of which there are several kinds in the treebank) were ignored, 
and nonterminals above them that dominated no pronounced words were also deleted.^] No attempt was made to 
do any smoothing. While in a few cases this would clearly be useful (e.g., the training data allows a compound 
noun to be modified by four adjectives, but not a simple noun), in practice the induced treebank grammar is 
hugely ambiguous, and greatly overgenerates. Thus, while the lack of smoothing in principle disallows some 
correct parses fr om being generated, the treebank grammar can always produce some parse for a sentence 
[ Charniak, 1996 1 and adding unseen rules with low probabilities is unlikely to improve bottom line performance, 
because these added rules are unlikely to appear in the maximum probability parse. A better solution would 
be to use a covering grammar with fewer rules and a more deeply nested structure. 

Testing was done by chart-parsing the part of speech tags of the sentences (i.e., ambiguities in part of 
speech assignment were assumed to be successfully resolved). An exhaustive chartparse was done and the 
highest probability (Viterbi) parse was selected, in the standard way [Charniak, 1993). Results from such 
parsing are shown in Table ^together with results from [Charniak, 1996 



The measures shown have been used 



5 Of course, we could easily integrate our model with a tagging model. 

6 Simply eliminating empties in the treebank is dangerous because they are the only trace of unbounded dependency constructions. 



This leads to ridiculous rules like 5" — > VP (with 23559 appearances in the treebank) stemming from S 
trace subject NP. A purely context-free solution would be to introduce slash percolation. 



NP VP where there is a 



2/4-40 word sentences 


Fob G 


Magerman 


Collins 


% sent, length < cutoff 


92.9% 






Test set size (sentences) 




1759 


2416 


Average Length (words) 


21.9 


22.3 




Precision 


78.8% 


86.3% 




Recall 


80.4% 


85.8% 




Labelled Precision 




84.5% 


86.3% 


Labelled Recall 




84.0% 


85.8% 


Average CBs 




1.33 


1.14 


Non-crossing accuracy 


87.7% 






Sentences with CBs 




55.4% 


57.2% 



Table 4: PCFG JCharniak, 1996| ] vs. QMagerman, 1995| /[ |Collins, 1996 | comparison 



in various earlier works, and generally draw from the PARSEVAL measures [Black et al., 1991]. Precision is how 
many brackets in the parse match those in the correct tree (perhaps also examining labels), recall measures how 
many of the brackets in the correct tree are in the parse. The unmarked measures ignore unary constituents, the 
ones marked +1 include unary constituents.^] Crossing brackets (CBs) measures record how many brackets in 
the parse cross bracketings in the correct tree, with the non-crossing accuracy measuring the percent of brackets 
that are not CBs. The '% sent, length < cutoff says what percentage of sentences within the 2416 sentence 
test section were shorter than the cutoff and thus used in the test set for the current experiment. Our results 
are not directly comparable to Charniak's since he was using an earlier release of the Penn Treebank. Further, 
he used two strategies that aimed at increasing performance: recoding auxiliaries from their Penn tag (which is 
undistinguished from other verbs) to special auxiliary tags, and adding a (crude) correction factor to the PCFG 
model so that it uniformly favored right-branching trees rather than being context free. Whether the former 
change was shown to be beneficial is not discussed, but the later correction factor improved results by about 
two percent. The fact that the results are mixed between the two systems suggests that the quality of the Penn 
Treebank has improved in the second release, and these gains roughly match the gains from these factors. It is 
useful that the results are roughly comparable since we can then use Charniak's results as a rough benchmark 
for PCFG performance on longer sentences, which we have not obtained.^ 

Charniak's central contention is that purely structural parsing like this using treebank grammars works 
much better than community lore would have you believe. Indeed, as the comparison in Table ^ suggests, it 



does not work much worse than Collins, 1996 , a leading recent parser that includes lexical content. That is, 
it seems one can score well in the PARSEVAL measures using purely structural factors, and that the use of 
lexical factors in other models is at present only adding a little to their performance^] This is in part because 
the Penn treebank does not represent certain semantic "attachment" decisions, and the structure of the trees 
minimizes the penalty for other "attachment" errors, as we discuss in the last section of this paper. 



3.2 LC parsing results 

The probabilistic left corner parser is implemented as a beam parser in C. As a fc-best beam parser, it is not 
guaranteed to find the best parse, unless the beam is effectively infinite. Space requirements depend on the 
size of the beam, and the length of the sentence, but are considerably more reasonable than those for the chart 
parser used above. Nevertheless, the branching factor in the search space is very high because there are many 

7 The original PARSEVAL measures (because they were designed for comparing different people's parsers that used different theories 
of grammar) ignored node labels entirely, discarded unary brackets, and performed other forms of tree normalization to give special 
treatment to certain cases such as verbal auxiliaries. While such peculiarities made some sense in terms of the purpose for which 
the measures were originally developed, it is not clear they are appropriate for the use of these measures within the statistical NLP 
community. Thus people often report labelled measures, but it is often unclear whether the other rules and transformations of the 
standard have been employed (but this affects the results reported). In this paper, unary nodes are deleted in measures except 
those marked +1, but none of the special case tree transformations in the PARSEVAL standard are applied. All punctuation and all 
constituent labels (but not functional tags) are also retained. 

8 Our PCFG parser builds a complete chart, which leads to unviable space requirements for long sentences. 

9 Again, results are not strictly comparable. The comparison is unfair to Magerman and Collins' systems since they are also 
doing part of speech tagging, whereas the PCFG is not. But on the other hand, Magerman and Collins' parsers conflate the ADVP 
and PRT labels and ignore all punctuation, which improves their reported results. 



Parser results 



2-12 Error 
Reduction 



Sentence Lengths: 
Beam size 

% sent, length < cutoff 
Test set (sentences) 
Average length (words) 



2-12 
50 000 
16.6% 



401 
8.3 



2-16 
50 000 
28.1% 



680 
10.9 



2-25 
40 000 
60.2% 
1454 
16.3 



Precision 
Recall 

Labelled Precision 
Labelled Recall 
Labelled Precision +1 
Labelled Recall +1 
Average CBs 
Non-crossing accuracy 
Sentences with CBs 



92.0% 
92.3% 
87.1% 
86.7% 
88.6% 
88.3% 
0.21 
96.8% 
87.5% 



90.1% 
89.5% 
86.0% 
84.9% 
87.7% 
86.3% 
0.43 
94.7% 
76.0% 



84.6% 
83.2% 
81.1% 
79.6% 
83.5% 
81.5% 
1.25 
89.6% 
52.0% 



21.6% 
17.2% 
21.8% 
22.2% 
11.6% 
20.9% 
22.2% 
23.8% 
19.4% 



Table 5: Left Corner Parser results from rt-ary grammar. 



rules possible with a certain left corner and goal category (especially for a grammar induced from the Penn 
Treebank in the manner we have described). Therefore, a huge beam is needed for good results to be obtained. 
A way of addressing this problem by binarizing the grammar is discussed in the next section. 

The parser maintains two beams, one containing partial LC parses of the first i words, and another in 
which is built a beam of partial LC parses of the first i + 1 words. The partial parses are maintained as pointers 
to positions in trie data structures that represent the list of parser moves to this point and the current parse 
stack. At the end of parsing, the lists of parser moves can be easily turned into parse trees for the n-best parses 
in the beam. 

Results are shown in Table [|. They reflect the same training and testing data as described above. The 
results show a small increase in performance for PLCGs. This is shown more dramatically in the right-hand 
column of Table 0, which shows that the extra information provided by the Left Corner goal category reduces 
parsing errors by about 20% over our PCFG results on 2-12 word sentences. 

4 Binarization 

For the parser above, use of a beam search is quite inefficient because, for a given left corner and goal category, 
there are often hundreds of possible local trees that could be projected, and little information is available at 
the time this decision is made, since the decision mainly depends on words that have not yet been shifted. 
Therefore the beam must be large for good results to be obtained, and, at any rate, the branching factor of the 
search space is extremely high, which slows parsing. One could imagine using various heuristics to improve the 
search, but the way we have investigated combatting this problem is by binarizing the grammar. 

The necessary step for binarization is to eliminate productions with three or more daughters. We carried 
this out by merging the tails, so that a rule such as NP — ► Det J J NN is replaced by two rules, NP — ► Det NP£, et 
and NP]j e t — > J J NN. This is carried out recursively until only binary rules remain. As a result of this choice of 
binarization, n-ary rules that share the same mother and left corner category all reduce to a single rule. This 
greatly cuts the branching factor of the search space and allows decisions to be put off during parsing, until 
more of the input has been seen, at which point alternative continuations can be better evaluated. Furthermore, 
the weights for such rules all combine into a larger weight for the combined rule.p| 

It is important to note that the resulting model is not equivalent to our original model. While the straight- 
forward way of binarizing a PCFG yields the same probability estimates for trees as the n-ary grammar, this 
is not true for our PLCG model since we are now introducing new estimates for shifting terminals for each of 
our newly created non- terminals. Slightly different probability estimates result, and further work is needed to 
investigate what relationship exists between them and the probability estimates of the original grammar. 



10 If we were to do this even for rules that start out binary, and eliminate unary rules downward rather than upward, then for 
every left corner C and mother G there will be a unique rule G — > C Gq- 







Parser 


Results 




2-25 Error 


Sentence Lengths: 


2-12 


2-16 


2-25 


2-40 


Reduction 


Beam size 


40 000 


40 000 


40 000 


40 000 




% sent, length < cutoff 


16.3% 


28.1% 


60.2% 


91.8% 




Test set (sentences) 




680 


1454 


2216 




Average length (words) 


8.3 


10.9 


16.3 


21.6 




Precision 


93.5% 


91.4% 


86.6% 


83.0% 


13.0% 


Recall 


92.8% 


89.9% 


84.3% 


80.7% 


6.5% 


Labelled Precision 


89.8% 


88.1% 


83.4% 


79.9% 


12.2% 


Labelled Recall 


88.4% 


86.3% 


81.0% 


77.6% 


6.9% 


Labelled Precision +1 


90.0% 


89.0% 


85.1% 


81.9% 


9.7% 


Labelled Recall +1 


89.5% 


87.2% 


82.6% 


79.5% 


5.9% 


Average CBs 


0.17 


0.39 


1.09 


1.99 


12.8% 


Non-crossing accuracy 


97.2% 


95.2% 


90.9% 


87.6% 


12.5% 


Sentences with CBs 


89.8% 


78.2% 


55.8% 


41.5% 


7.9% 



Table 6: Left Corner Parser results with binarized grammar 



Prior to the above binarization step, one might also wish to eliminate unary productions, much as we earlier 
eliminated empty categories. This can be done in two ways. One way is to fold them upwards. This preserves 
lexical tagging. That is, if there is a category A dominating only a tree rooted at B, then the category A is 
eliminated and the tree rooted at B moved upwards. This may cause the number of rules to increase, because a 
local tree that started with a daughter A will now show up with daughter B in the same place. The alternative 
is to eliminate B and replace it with A. This can also create a new local tree instance because the daughters 
of B now show up with a new mother A. In this way, lexical tags can be changed. For instance, consider a 
rule NP — > NNP for a noun phrase rewriting as a proper noun. In the context of eliminating empty categories 
upward, we get a new rule S — > NNP VP, whereas by eliminating empty categories downward we would have 
produced a new lexical entry NP — > Jones. Our current results do not reflect the elimination of unary rules, 
but we hypothesize that doing this would further improve the measures that do not consider unary nodes, while 
probably harming the results on measures that do include unary nodes. 

Table [| shows that binarization brings a further modest improvement in results. The righthand column 
shows the percent error reduction on 2-25 word sentences between the n-ary grammar and the binary grammar. 



5 Extended Left Corner Models 

More sophisticated PLCG parsing models will naturally provide greater conditioning of the probability of an 
elementary operation based on the parse history. There are a number of ways that one could then proceed. 
From the background of work on LC parsing, a natural factor to consider is the size of the parse stack, and we 

will briefly investigate incorporating this factor. 

For left corner parsing, the stack size is particularly interesting. Stabler, 1994 1 notes that in contrast to 



bottom-up and top-down parsing methods, left-corner parsers can handle arbitrary left-branching and right- 
branching structures with a finitely bounded stack size. Furthermore, left-corner parses of center embedded 
constructions have stack lengths proportional to the amount of embedding. This is not actually true for the stack- 
based LC parser presented earlier which has the stack growing without bound for rightward-branching trees. 
To gain the desirable property that Stabler notes, one needs to do stack composition by deciding immediately 
whether to attach whenever one projects a category that matches the current goal category, rather than delaying 
the attachment until after the left corner constituent is complete. If one decides to attach, the goal minus 
category and the mother category are immediately removed from the stack. This is implemented by replacing 
the predicate process in Figure ^|by the code in Figure |3|, where composition is done by the predicate compose. 
All else being equal, this change makes no difference to the probabilistic model presented earlier. In practice 
though, this formulation makes a beam search less effective since we are bringing forward the decision of whether 
to attach or not, and often both alternatives must be tried which fills out the beam unnecessarily. 

Given human intolerance for center embeddings and ease of parsing left and right branching structures, we 
would expect the stack sizes to stay low. The general prediction (and empirical fact) is that the probability 



process ( [Item I Items] , Stack, L, L) :- lc 

rule(LHS, [Item I Rest] ) , 

compose(LHS, Items, Stackl) , 

predict (Rest , Stackl, Stack), 
process (Stack, [LlStack], [L|Ls], Ls) . shift 

compose(A, [m(A)|L], L). 7, attach 
compose (A, L, [A I L] ) . '/, don't 



Figure 3: Altered code for a Prolog LC parser that does stack composition 
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1291 
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5 


3745 


1393 (37%) 


1365 (36%) 


987 (26%) 


4 


17000 


12116 (71%) 


3491 (21%) 


1393 (8%) 


3 


39105 


12241 (31%) 


14748 (38%) 


12116 (31%) 


2 


108544 


63160 (58%) 


33143 (31%) 


12241 (11%) 


1 


71681 


8521 (12%) 





63160 (88%) 



Table 7: Changes in stack size for Penn Treebank 



of not attaching (composing) decreases slightly with the size of the stack. The counts for the Penn Treebank, 
after binarization (including removal of unary rules) , are given in Table |. Once one looks beyond the odd-even 
effect in the data, the decreasing probability of not attaching can be clearly seen. 

To incorporate stack length into the model, we wish to more accurately predict: P{Ci\C\, . . . , Cj_i). 
Previously histories of parse steps were equivalenced according to just the goal category and the left corner (if 
present). Now, we are going to additionally differentiate parse histories depending on £, the length of the stack 
after Cj_i. As before, if the top of the stack is a predicted category, we will shift; otherwise we cannot shift. In 
the latter case, we will predict Ps(S\£, gc, lc), the probability of various changes in the stack size, based on the 
stack size, the goal category, and the left corner. 

Let us assume that rules are binary, as discussed above. Then the possible change in the stack length from 
a single elementary operation is between —2 and +1. Given that we are not shifting, we have the following: 

Stack Delta Rule Type 

—2 unary left corner and attach 

— 1 binary left corner and attach 

unary left corner (no attach) 

+1 binary left corner (no attach) 

The probability of each elementary operation will then be the probability of a certain stack delta (given the 
stack size, left corner and goal category) times the probability of a certain rule, given the left corner, the goal 
category, and the stack delta. Whether we attach (compose) or not is deterministic given the stack delta, and 
so we no longer need to model the P a tt distribution. Under this model, the probability of different projection 
operations (given that we are not in a position to shift) becomes: 



P(Ci = project A — * lc, attach) 
P(Ci = project^ — > lc, do not attach) 
P{Ci — project^ — > lc c 2 , attach) 
P(Ci = project v4 — » lc C2, do not attach) 



Ps(-2\£, lc, gc)P lc , (A -» lc\lc, gc, £, S) 
P s (Q\£, lc, gc)P lc , (A -> lc\lc, gc, 1, 8) 
P 8 (-l\t, lc, gc)P ld {A -» lc c 2 \lc, gc, I, S) 
P s (+l\£,lc,gc)P lc ,{A -> lc c 2 \lc,gc,£, S) 



A variety of other extended models are possible. Note that the model does not need to be uniform, and that 
we can estimate different classes of elementary operations using different probabilistic submodels. In particular, 



at the level of preterminals, we can incorporate a tagging model. Given the structure of the Penn Treebank, 
a terminal Wj is always dominated by a unary rule giving the terminal's part of speech pj. In line with our 
basic model above, the choice of part of speech for a word (where the word now counts as the left corner) 
will certainly depend on the current goal category. However, we can also condition it on other preceding parse 
decisions, in particular on the part of speech of the preceding two words, or, perhaps, in certain circumstances, 
on the particular word that preceded. Taking the former possibility, we can say, for cases where Ci involves 
predicting a preterminal (through a left corner projection step), 

P(C i |Ci,...,C i _i) = P(p i |ti; J ,ffc s Ci,...,C i _i) 

~ P (Pj\ w J>9C,Pj-2,Pj-l) 

Assuming - perhaps rashly - independence between the conditioning variables that we have been using before 
(Ic, gc) and the new ones (pj-i, Pj-2), then we have that: 



P(Pi) 

This is nice in part because we can calculate it using just the statistics previously gathered and a simple trigram 
model over POS tags. (An incidental nice property is that the probability is for the POS given the word, and 
not the other way round, as in the 'confusing' P(w\t) term of standard Markov model POS taggers. )Q 



6 Comparison with previous work 



Previous work on non-lexicalized parsing of Penn Treebank data includes Schabes et al., 1993] and 
[ Charniak, 1996[, but the work perhaps most relevant to our own is that of Briscoe and Carroll, 1993| and 



[Carroll and Briscoe, 1996 , which also seeks to address the context-freeness assumption of PCFGs. They ap- 
proach the problem by using a probabilistic model based on LR parse tables. Unfortunately, many differences of 
approach make meaningful comparisons difficult, and a comparable study of using PLCGs versus Probabilistic 
LR parsing remains to be done. Briscoe and Carroll use their Probabilistic LR grammar to guide the actions of 
a unification-based parser which uses a hand-built grammar. While we are sympathetic with their desire to use 
a more knowledge-based approach, this means that: their language model is deficient, since probability mass 
is given to derivations which are ruled out because of unification failures; the coverage of their parser is quite 
limited because of limitations of the grammar used; and much time needs to be expended in developing the 
grammar, whereas our grammar is acquired automatically (and quickly) from the treebank. Moreover, while 
results are n ot directly comparable, our parsers seem to do rather better on precision and recall than the parser 



described in [Carroll and Briscoe, 1996 1 , while performing somewhat worse on crossing brackets measures. How- 
ever, Carroll and Briscoe's inferior results probably reflect the fact that the parse trees of their grammar do not 
match those of their treebank more than anything else. 



7 Observations on why parsing the Penn Treebank is easy 

How is it that the purely structural - and context free even in structural terms - PCFG parser manages 
to perform so well? An important observation is that the measures of precision and recall (labelled or not) 
and crossing brackets are actually quite easy measures to do well on. It is important to notice that they are 
measuring success at the level of individual decisions - and normally what makes NLP hard is that you have 
to make many consecutive decisions correctly to succeed. The overall success rate is then the n th power of the 
individual decision success rate - a number that easily becomes small. 

But beyond this, there are a number of features particular to the structure of the Penn Treebank that make 
these measures particularly easy. Success on crossing brackets is helped by the fact that Penn Treebank trees 

11 Below is a derivation of this equation, written in a simplified form with three variables. We assume b and c are independent, 
and the result then follows by using Bayes' rule followed by the definition of conditional probability: 

_ P{a)P(b,c\a) _ P(a)P(b\a)P(c\a,b) _ P(a)P(b\a)P(c\a) _ P(a)P(b,a)P(c,a) _ P(a\b)P(a\c) 
' ~ P(b, c) ~ P(b)P(c\b) ~ P(b)P{c) ~~ P(a) 2 P(b)P(c) ~ P(o) 



Perm VP attach (VP saw (NP the man) (PP with (NP a telescope))) 

Penn NP attach (VP saw (NP (NP the man) (PP with (NP a telescope)))) 

Another VP attach (VP saw (NP the (N' man)) (PP with (NP a (N' telescope))))) 

Another NP attach (VP saw (NP the (N' man (PP with (NP a (N' telescope)))))) 



Table 8: Penn trees versus other trees 



are quite flat. To the extent that sentences have very few brackets in them, the number of crossing brackets 
is likely to be small. Identifying troublesome brackets that would lower precision and recall measures is also 
avoided. As a concrete instance of this, one difficulty in parsing is deciding the structure of noun compounds 



[Lauer, 1995 . Noun compounds of three or more words in length can display any combination of left- or right- 
branching structure, as in [[cable modem] manufacturer] vs. [computer [power supply]]. But such fine points 
are finessed by the Penn Treebank, which gives a completely flat structure to a noun compound (and any other 
pre-head modifiers) as shown below (note that the first example also illustrates the rather questionable Penn 
Treebank practice of tagging hyphenated non-final portions of noun compounds as adjectives!). 

(NP a/DT stock- index/ J J arbitrage/NN sell/NN program/NN ) 
(NP a/DT joint/jJ venture/NN advertising/NN agency/NN ) 

Another case where peculiarities of the Penn Treebank help is the (somewhat nonstandard) adjunction 
structures given to post noun- head modifiers, of the general form (NP (NP the man) (PP in (NP the moon))). 
A well-known parsing ambiguity is whether PPs attach to a preceding NP or VP - or even to a higher preceding 
node - and this is one for which lexical or contextual information is clearly much more important than structural 



factors [ Hindlc and Rooth, 1993 1 . Note now that the use of the above adjunction structure reduces the penalty 
for making this decision wrongly. Compare Penn Treebank style structures, and another common structure in 
the examples in Table ||. Note the difference in the results: 

Error Errors assessed 

Prec. Rec. CBs 

Penn VP instead of NP 1 

NP instead of VP 1 

Another VP instead of NP 1 2 1 

NP instead of VP 2 1 1 

The forgivingness of the Penn Treebank scheme is manifest. One can get the attachment wrong and not have 
any crossing brackets]^ 



8 Conclusions 

This paper explores a new class of probabilistic parsing algorithms for context-free grammars, probabilistic left 
corner grammars. The ability of left corner parsers to support left-to-right online parsing makes them initially 
promising for many tasks. The different conditioning model is slightly richer than that of standard PCFGs, and 
this was shown to bring worthwhile performance improvements over a standard PCFG when used to parse Penn 
Treebank sentences. Beyond this, the model can be extended in various ways, an avenue that we have only just 
begun exploring. Because the left-corner component of the grammar is purely structural, it can be combined 
with other models that include lexical attachment preferences and preferences for basic phrasal chunking (both 
incorporated into Collins' parser). 



References 

[Baker, 1979] Baker, J. K. (1979). Trainable grammars for speech recognition. In Klatt, D. H. and Wolf, 
J. J., editors, Speech Communication Papers for the 97th Meeting of the Acoustical Society of America, pages 
547-550. 

12 If one includes unary brackets (recall footnote |?]) , then the contrast becomes even more marked, since there would be 2 precision 
and recall errors each under the alternative parse trees. 



[Black et al., 1991] Black, E., Abney, S., Flickinger, D., Gdaniec, C, Grishman, R., Harrison, P., Hindlc, D., 
Ingria, R., Jelinek, F., Klavans, J., Liberman, M., Marcus, M., Roukos, S., Santorini, B., and Strzalkowski, T. 
(1991). A procedure for quantitatively comparing the syntactic coverage of English grammars. In Proceedings, 
Speech and Natural Language Workshop, Pacific Grove, CA, pages 306-311. DARPA. 

[Briscoe and Carroll, 1993] Briscoe, T. and Carroll, J. (1993). Generalized probabilistic LR parsing of natural 
language (corpora) with unification-based methods. Computational Linguistics, 19:25-59. 

[Carroll and Briscoe, 1996] Carroll, J. and Briscoe, T. (1996). Apportioning development effort in a probabilistic 
LR parsing system through evaluation. In Proceedings of the Conference on Empirical Methods in Natural 
Language Processing (EMNLP-96), pages 92-100, University of Pennsylvania. 

[Charniak, 1993] Charniak, E. (1993). Statistical Language Learning. MIT Press, Cambridge, MA. 

[Charniak, 1996] Charniak, E. (1996). Tree-bank grammars. Technical Report Technical Report CS-96-02, 
Dept of Computer Science, Brown University. 

[Collins, 1996] Collins, M. J. (1996). A new statistical parser based on bigram lexical dependencies. In Pro- 
ceedings of the S4th Annual Meeting of the Association for Computational Linguistics, pages 184-191. 

[Dcmcrs, 1977] Demers, A. (1977). Generalized left corner parsing. In Proceedings of the Fourth Annual ACM 
Symposium on Principles of Programming Languages, pages 170-181. 

[Hindlc and Rooth, 1993] Hindlc, D. and Rooth, M. (1993). Structural ambiguity and lexical relations. Com- 
putational Linguistics, 19:103-120. 

[Jelinek et al., 1992] Jelinek, F., Laffcrty, J. D., and Mercer, R. L. (1992). Basic methods of probabilistic 
context free grammars. In Laface, P. and De Mori, R., editors, Speech Recognition and Understanding: 
Recent Advances, Trends, and Applications, volume 75 of Series F: Computer and Systems Sciences. Springer 
Verlag. 

[Kupicc, 1991] Kupiec, J. (1991). A trellis-based algorithm for estimating the parameters of a hidden stochastic 
context-free grammar. In Proceedings of the Speech and Natural Language Workshop, pages 241-246. DARPA. 

[Lari and Young, 1990] Lari, K. and Young, S. J. (1990). The estimation of stochastic context-free grammars 
using the inside-outside algorithm. Computer Speech and Language, 4:35-56. 

[Lauer, 1995] Lauer, M. (1995). Designing Statistical Language Learners: Experiments on Noun Compounds. 
PhD thesis, Macquarie University, Sydney, Australia. 

[Magerman, 1995] Magerman, D. M. (1995). Statistical decision-tree models for parsing. In Proceedings of the 
SSst Annual Meeting of the Association for Computational Linguistics, pages 276-283. 

[Manning, 1996] Manning, C. D. (1996). Ergativity: Argument Structure and Grammatical Relations. CSLI. 

[Marcus et al., 1993] Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A. (1993). Building a large annotated 
corpus of English: The Penn treebank. Computational Linguistics, 19:313-330. 

[Roscnkrantz and Lewis II, 1970] Rosenkrantz, S. J. and Lewis II, P. M. (1970). Deterministic left corner parser. 
In IEEE Conference Record of the 11th Annual Syposium on Switching and Automata, pages 139-152. 

[Sankoff, 1971] Sankoff, D. (1971). Branching processes with terminal types: applications to context-free gram- 
mars. Journal of Applied Probability, 8:233-240. 

[Schabes et al., 1993] Schabes, Y., Roth, M., and Osborne, R. (1993). Parsing the Wall Street Journal with the 
Inside-Outside algorithm. In Proceedings of the Sixth Conference of the European Chapter of the Association 
for Computational Linguistics, pages 341-347, University of Utrecht. 

[Stabler, 1994] Stabler, E. P. (1994). The finite connectivity of linguistic structure. In Clifton, C, Frazicr, L., 
and Rayner, K., editors, Perspectives on Sentence Processing, pages 303-336. Lawrence Erlbaum, Hillsdale, 
NJ. 

[Suppes, 1970] Suppes, P. (1970). Probabilistic grammars for natural languages. Synthese, 22:95-116. 



